Goto

Collaborating Authors

 lower-dimensional space


High-Dimensional Statistical Process Control via Manifold Fitting and Learning

Tas, Burak I., del Castillo, Enrique

arXiv.org Machine Learning

We address the Statistical Process Control (SPC) of high-dimensional, dynamic industrial processes from two complementary perspectives: manifold fitting and manifold learning, both of which assume data lies on an underlying nonlinear, lower dimensional space. We propose two distinct monitoring frameworks for online or 'phase II' Statistical Process Control (SPC). The first method leverages state-of-the-art techniques in manifold fitting to accurately approximate the manifold where the data resides within the ambient high-dimensional space. It then monitors deviations from this manifold using a novel scalar distribution-free control chart. In contrast, the second method adopts a more traditional approach, akin to those used in linear dimensionality reduction SPC techniques, by first embedding the data into a lower-dimensional space before monitoring the embedded observations. We prove how both methods provide a controllable Type I error probability, after which they are contrasted for their corresponding fault detection ability. Extensive numerical experiments on a synthetic process and on a replicated Tennessee Eastman Process show that the conceptually simpler manifold-fitting approach achieves performance competitive with, and sometimes superior to, the more classical lower-dimensional manifold monitoring methods. In addition, we demonstrate the practical applicability of the proposed manifold-fitting approach by successfully detecting surface anomalies in a real image dataset of electrical commutators.


Adaptive Linear Embedding for Nonstationary High-Dimensional Optimization

Wen, Yuejiang, Franzon, Paul D.

arXiv.org Machine Learning

Bayesian Optimization (BO) in high-dimensional spaces remains fundamentally limited by the curse of dimensionality and the rigidity of global low-dimensional assumptions. While Random EMbedding Bayesian Optimization (REMBO) mitigates this via linear projections into low-dimensional subspaces, it typically assumes a single global embedding and a stationary objective. In this work, we introduce Self-Adaptive embedding REMBO (SA-REMBO), a novel framework that generalizes REMBO to support multiple random Gaussian embeddings, each capturing a different local subspace structure of the high-dimensional objective. An index variable governs the embedding choice and is jointly modeled with the latent optimization variable via a product kernel in a Gaussian Process surrogate. This enables the optimizer to adaptively select embeddings conditioned on location, effectively capturing locally varying effective dimensionality, nonstationarity, and heteroscedasticity in the objective landscape. We theoretically analyze the expressiveness and stability of the index-conditioned product kernel and empirically demonstrate the advantage of our method across synthetic and real-world high-dimensional benchmarks, where traditional REMBO and other low-rank BO methods fail. Our results establish SA-REMBO as a powerful and flexible extension for scalable BO in complex, structured design spaces.


Sustainable Visions: Unsupervised Machine Learning Insights on Global Development Goals

García-Rodríguez, Alberto, Núñez, Matias, Pérez, Miguel Robles, Govezensky, Tzipe, Barrio, Rafael A., Gershenson, Carlos, Kaski, Kimmo K., Tagüeña, Julia

arXiv.org Artificial Intelligence

The United Nations 2030 Agenda for Sustainable Development outlines 17 goals to address global challenges. However, progress has been slower than expected and, consequently, there is a need to investigate the reasons behind this fact. In this study, we used a novel data-driven methodology to analyze data from 107 countries (2000$-$2022) using unsupervised machine learning techniques. Our analysis reveals strong positive and negative correlations between certain SDGs. The findings show that progress toward the SDGs is heavily influenced by geographical, cultural and socioeconomic factors, with no country on track to achieve all goals by 2030. This highlights the need for a region specific, systemic approach to sustainable development that acknowledges the complex interdependencies of the goals and the diverse capacities of nations. Our approach provides a robust framework for developing efficient and data-informed strategies, to promote cooperative and targeted initiatives for sustainable progress.


Uncovering Hidden Meaning: A Beginner's Guide to Latent Semantic Analysis

#artificialintelligence

If you have ever worked with text data, you have likely encountered the challenge of dealing with high-dimensional and sparse data. One popular solution to this problem is latent semantic analysis (LSA), also known as latent semantic indexing (LSI). LSA is a technique for extracting latent (hidden) semantics from a collection of documents or text data. It does this by mapping the documents into a lower-dimensional space, where the relationships between the documents and the underlying concepts they represent can be more easily understood. One of the key benefits of LSA is that it can handle large amounts of data efficiently and is robust to noise and sparse data.


Uncovering the Essence of Principle Component Analysis: A Comprehensive Guide

#artificialintelligence

Principal component analysis (PCA) is a popular statistical technique for reducing the dimensionality of a dataset while preserving important patterns and relationships in the data. At its core, PCA is a linear transformation method that projects the data onto a lower-dimensional space, revealing the underlying structure of the data. But what exactly is PCA and how does it work? In this article, we'll delve into the fundamentals of PCA and explore its applications in a variety of fields, including machine learning, data visualization, and image processing. We'll also discuss some of the key challenges and limitations of using PCA, and provide practical tips for implementing it in your own analyses.


A Beginner's Guide to Principal Component Analysis

#artificialintelligence

In principal component analysis, a principal component is a new feature that is constructed from a linear combination of the original features in a dataset. The principal components are ordered such that the first principal component has the highest possible variance (i.e., the greatest amount of spread or dispersion in the data), and each subsequent component in turn has the highest variance possible under the constraint that it is orthogonal (i.e., uncorrelated) to the previous components. The idea behind PCA is to reduce the dimensionality of a dataset by projecting the data onto a lower-dimensional space, while still preserving as much of the variance in the data as possible. This is done by selecting a smaller number of principal components that capture the most important information in the data, and discarding the remaining, less important components. In this way, PCA can be used to identify patterns and relationships in high-dimensional data, and to visualize data in a lower-dimensional space for easier interpretation.


ML

#artificialintelligence

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Dimensionality Reduction is the technique of representing n-dimensions data(multidimensional data with many features) in 2 or 3 dimensions. An example of dimensionality reduction can be discussed as a classification problem i.e. student will play football or not that relies on both temperature and humidity can be collapsed into just one underlying feature, since both of the features are correlated to a high degree. Hence, we can reduce the number of features in such problems. A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to simple 2-dimensional space and a 1-D problem to a simple line.


Understanding dimensionality reduction in machine learning models

#artificialintelligence

Machine learning algorithms have gained fame for being able to ferret out relevant information from datasets with many features, such as tables with dozens of rows and images with millions of pixels. Thanks to advances in cloud computing, you can often run very large machine learning models without noticing how much computational power works behind the scenes. But every new feature that you add to your problem adds to its complexity, making it harder to solve it with machine learning algorithms. Data scientists use dimensionality reduction, a set of techniques that remove excessive and irrelevant features from their machine learning models. Dimensionality reduction slashes the costs of machine learning and sometimes makes it possible to solve complicated problems with simpler models. Machine learning models map features to outcomes.


Machine learning: What is dimensionality reduction?

#artificialintelligence

Machine learning algorithms have gained fame for being able to ferret out relevant information from datasets with many features, such as tables with dozens of rows and images with millions of pixels. Thanks to advances in cloud computing, you can often run very large machine learning models without noticing how much computational power works behind the scenes. But every new feature that you add to your problem adds to its complexity, making it harder to solve it with machine learning algorithms. Data scientists use dimensionality reduction, a set of techniques that remove excessive and irrelevant features from their machine learning models. Dimensionality reduction slashes the costs of machine learning and sometimes makes it possible to solve complicated problems with simpler models.


Framework for Data Preparation Techniques in Machine Learning

#artificialintelligence

There are a vast number of different types of data preparation techniques that could be used on a predictive modeling project. In some cases, the distribution of the data or the requirements of a machine learning model may suggest the data preparation needed, although this is rarely the case given the complexity and high-dimensionality of the data, the ever-increasing parade of new machine learning algorithms and limited, although human, limitations of the practitioner. Instead, data preparation can be treated as another hyperparameter to tune as part of the modeling pipeline. This raises the question of how to know what data preparation methods to consider in the search, which can feel overwhelming to experts and beginners alike. The solution is to think about the vast field of data preparation in a structured way and systematically evaluate data preparation techniques based on their effect on the raw data.